Behavioral Simulation of Drosophila Larvae

Data Engineering, DOE Design, and Event-Hazard Modeling

Gil Raitses

2025-10-30

Overview

This presentation describes a for using from . The project develops to predict behavioral events (turns, stops, reversals) as functions of , , and , enabling simulation-based prediction of behavioral metrics through a framework.

The methodology connects with biological data analysis, demonstrating how stochastic events occur as functions of external stimuli and internal state. Results are reported using including , , and analogous to manufacturing system analysis.

The Pipeline

graph LR
    A[H5 Files] --> B[Data Engineering]
    B --> C[Feature Extraction]
    C --> D[Hazard Model Fitting]
    D --> E[DOE Simulation]
    E --> F[KPI Analysis]

Data Engineering Process

From Raw Tracking Data to Simulation-Ready Features

The transforms raw H5 files containing larval trajectory data into simulation-ready features. Input data includes position coordinates (x, y) sampled at Hz, head/mid/tail coordinates providing body orientation, and synchronized LED stimulus timing with intensity values.

The processing script engineer_dataset_from_h5.py performs five sequential operations. First, it extracts trajectory features including speed, angle, and curvature from position data. Second, it detects behavioral events using , identifying runs, turns, and pauses based on speed, curvature, and body bend angle thresholds. Third, it applies to classify trajectory segments into discrete behavioral states. Fourth, it generates the , a standardized -column structure organizing run and turn events with derived metrics. Fifth, it extracts using temporal kernel bases within integration windows relative to stimulus onsets.

\tech{Data Engineering Pipeline Steps}
Processing Step Description Output Features
1. Feature Extraction Speed, angle, curvature calculation ~50 spatial/temporal features
2. Event Detection Turn, pause, reversal identification Event timestamps and classifications
3. MAGAT Segmentation Run segment classification Run/turn/pause segments
4. Klein Run Table 18-column standardized structure Run table with derived metrics
5. Stimulus Features Temporal kernel basis extraction Stimulus-locked feature vectors

The output consists of an engineered CSV dataset containing approximately + features per trajectory, including spatial features (position, speed, acceleration, curvature, body bend angle, wall distance), temporal features (time since last event, stimulus history, integration window features), event classifications (run segments, turn events, pause events, heading reversals, reverse crawling), and stimulus features (LED intensity PWM values, pulse duration, inter-pulse interval, time since onset).

Key Data Features

Behavioral Features Extracted

The engineered dataset captures multiple feature categories essential for . include position coordinates (x, y), speed and acceleration calculated from position derivatives, curvature derived from path geometry, body bend angle computed from head-mid-tail orientation, and distance from arena walls. track time since last behavioral event, stimulus history vectors, and features extracted within integration windows centered on stimulus onsets.

\tech{Feature Categories in Engineered Dataset}
Feature Category Example Features Approximate Count
Spatial Features Position (x,y), speed, acceleration, curvature, body bend angle, wall distance ~30 features
Temporal Features Time since last event, stimulus history, integration window features ~25 features
Event Detection Run segments (MAGAT), turn events, pause events, heading reversals, reverse crawling (Klein) ~20 features
Stimulus Features LED intensity (PWM), pulse duration, inter-pulse interval, time since onset ~25 features

identifies run segments using , turn events marking reorientation transitions, pause events indicating periods of low speed, heading reversals representing discrete -degree turns, and reverse crawling detected using the analyzing movement-orientation angle relationships. encode LED intensity as PWM values (, , ), pulse duration in seconds, inter-pulse interval timing, and time since stimulus onset within integration windows.

DOE Design

Factorial Design: 3 Factors × Multiple Levels

The employs a factorial structure with three factors at multiple levels. varies across three PWM values (, , ) representing threshold, moderate, and high stimulus levels. spans five levels (s, s, s, s, s) covering short to extended stimulus presentations. includes three levels (s, s, s) representing rapid, moderate, and spaced stimulus timing.

\tech{DOE Factor Structure}
Factor Levels Description Design Contribution
Intensity 3 (PWM 250, 500, 1000) LED stimulus intensity as PWM percentage 45 conditions
Pulse Duration 5 (10s, 15s, 20s, 25s, 30s) Duration of each stimulus pulse 45 conditions
Inter-Pulse Interval 3 (5s, 10s, 20s) Time between consecutive pulse onsets 45 conditions

The complete design encompasses conditions total ( × × ), with replications per condition yielding total simulations. Estimated runtime approximates hours based on observed simulation rates of approximately seconds per replication. This design structure covers biologically relevant parameter space, enables and through factorial combinations, and provides statistical power for validation through multiple replications per condition.

Simulation Process

How the Simulation Works

The simulation execution follows a four-step process. Step 1 loads fitted models including the reorientation hazard model, pause hazard model, and heading reversal hazard model, along with learned event parameters defining detection thresholds. Step 2 iterates through each DOE condition, setting the stimulus schedule (intensity, duration, interval) and initializing trajectory state (position, speed, angle).

\tech{Simulation Execution Process}
Step Process Technical Details
1. Model Loading Load 3 fitted hazard models and learned event parameters Reorientation, pause, reversal models; detection thresholds
2. Condition Setup Set stimulus schedule and initialize trajectory state Intensity, duration, interval; initial position, speed, angle
3. Time-Step Simulation Calculate hazard rates, sample event times, update trajectory Hazard rate calculation, event time sampling, state updates
4. KPI Computation Calculate 7 KPIs per replication, aggregate across 30 replications KPI calculation, confidence intervals, across-replications summary

Step 3 performs time-step simulation by calculating hazard rates from fitted models at each time point, sampling event times (turns, pauses, reversals) based on hazard probabilities, updating trajectory state (position, speed, angle) according to event occurrences, and recording all behavioral events with timestamps. Step 4 computes KPIs by calculating seven metrics per replication, aggregating across replications with statistical summaries, and generating confidence intervals for each KPI using standard error calculations.

Key Performance Indicators

7 Metrics to Validate Model

The simulation computes seven per replication to validate model predictions against empirical observations. measures reorientations per minute, calculated as total turns divided by total time in minutes. quantifies time to first turn after stimulus onset, measured within the integration window s, +s] relative to stimulus onset. represents the proportion of time spent paused, computed as total stop time divided by total simulation time.

\tech{Key Performance Indicators (KPIs)}
KPI Name Units Calculation Method
Turn Rate turns/min total_turns / (total_time / 60)
Latency seconds min(turn_times) - stimulus_onset (within integration window)
Stop Fraction dimensionless total_stop_time / total_time
Pause Rate pauses/min total_pauses / (total_time / 60)
Reversal Rate reversals/min total_reversals / (total_time / 60)
Tortuosity dimensionless euclidean_distance / path_length
Mean Spine Curve Energy dimensionless mean(spine_curve_energy)

counts pauses per minute using speed-based detection with learned thresholds. measures heading reversals per minute, representing discrete -degree turn events distinct from reverse crawling. quantifies path efficiency as the ratio of euclidean distance to path length, ranging from (highly tortuous) to (straight path). captures body bend energy averaged over the trajectory. These KPIs capture different aspects of behavior, remain comparable to empirical observations, are computable from trajectory data, and hold biological meaning for larval navigation and stimulus response.

Event-Hazard Models

How We Predict Behavioral Events

The models the instantaneous probability of behavioral events occurring at time \(t\) conditioned on feature vector \(\mathbf{x}\):

\[h(t | \mathbf{x}) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2 + \ldots)\]

where \(h_0(t)\) represents the baseline hazard and the exponential term captures feature effects. Model features include temporal kernel basis functions (raised cosine) encoding stimulus history, stimulus intensity effects scaling with PWM values, contextual features (speed, orientation, wall proximity) affecting event probabilities, and interaction terms capturing nonlinear relationships.

\tech{Hazard Model Feature Structure}
Feature Type Description Model Representation
Temporal Kernel Raised cosine basis functions over integration window Basis expansion
Stimulus Intensity PWM intensity values (250, 500, 1000) Linear coefficient
Contextual Features Speed, orientation, wall proximity Linear coefficients
Interactions Nonlinear feature combinations Product terms

Model fitting employs (Poisson for count events, logistic for binary events) with using leave-one-larva-out methodology and with \(\lambda = 0.1\) to prevent overfitting. The output provides coefficients for each feature quantifying effect sizes, time-varying hazard rates enabling event probability calculation at any time point, and event probability distributions for stochastic event sampling during simulation.

Parameter Learning

Calibrating Detection Thresholds

Event detection requires threshold parameters that cannot be directly observed but must be inferred from empirical data distributions. requires a speed threshold defining when movement ceases and a minimum duration ensuring pauses represent meaningful behavioral states rather than transient speed fluctuations. requires an angle threshold distinguishing discrete heading reversals from gradual turns. requires curvature and body bend thresholds for run/turn classification.

\tech{Parameter Learning Methodology}
Parameter Learning Method Learned Value Calibration Target
Pause Speed Threshold Speed distribution percentiles 0.003 mm/s Match empirical pause rate (~4 pauses/min)
Pause Min Duration Empirical pause duration analysis 0.2 seconds Filter transient pauses
Reversal Angle Threshold Heading change distribution 45-70 degrees Distinguish reversals from turns
MAGAT Curvature Threshold Curvature distribution percentiles Data-driven percentile Optimize run detection
MAGAT Body Bend Threshold Body bend angle distribution percentiles Data-driven percentile Optimize run detection

The parameter learning solution analyzes empirical data distributions to set thresholds. For pause detection, the system analyzes speed distributions, sets thresholds based on percentiles (typically th-th percentile), and calibrates to match empirical pause rates (approximately pauses per minute). For reversal detection, it analyzes heading change distributions and sets angle thresholds (typically - degrees) to distinguish reversals from gradual turns. For MAGAT segmentation, it analyzes curvature and body bend angle distributions, sets thresholds using percentile-based methods, and optimizes to ensure runs are correctly detected without excessive fragmentation.

Validation Results

Testing Before Full DOE

Pre-DOE validation employed a single condition test simulating replications to verify KPI calculations and compare values against empirical ranges. The test verified biologically plausible values across all metrics, ensuring parameter learning produced consistent detection, and confirmed models capture stimulus-response dynamics.

\tech{Pre-DOE Validation Results}
KPI Empirical Range Validation Status Notes
Turn Rate 0.4-11 turns/min Within range Validated against empirical turn distributions
Latency 0.5-5 seconds Within range Measured relative to stimulus onset in integration window
Pause Rate 1-10 pauses/min Within range Speed-based detection matches empirical pause patterns
Stop Fraction 0-1 Within range Proportion of time below speed threshold

Validation results demonstrate turn rates ranging from to turns per minute, matching empirical distributions. Latency values fall between and seconds when measured relative to stimulus onset within the integration window, confirming correct temporal alignment. Pause rates range from to pauses per minute using speed-based detection matching empirical pause patterns. Stop fraction values span to , representing the proportion of time spent below the speed threshold. All values fall within expected biological ranges, parameter learning ensures consistency with empirical detection methods, and models successfully capture stimulus-response dynamics observed in real larval behavior.

Simulation Execution

Running the Full DOE

The full DOE simulation executes through a structured process. The system loads three fitted hazard models (reorientation, pause, heading reversal) and learned event parameters defining detection thresholds. It iterates through conditions, executing replications per condition with checkpointing after each condition completion to enable resume capability.

\tech{Simulation Execution Features}
Feature Description Implementation
Checkpointing Save progress after each condition Checkpoint CSV updated after each condition
Progress Monitoring Real-time progress display with replication counts Monitor script updates every 2 seconds
Error Handling Continue on individual errors with max threshold Max 10 errors per condition, continue otherwise
Resume Capability Restart from last checkpoint if interrupted Checkpoint file contains completed replications

Progress monitoring provides real-time progress display showing total replications completed, conditions completed out of , progress percentage calculated from total replications ( total), remaining conditions count, and estimated time remaining based on observed replication rates. Error handling continues simulation on individual replication failures, tracks error counts per condition, and stops only if error threshold ( errors) is exceeded. Resume capability enables restarting from the last checkpoint if simulation is interrupted, loads completed replications from checkpoint file, and continues from the next incomplete condition.

Expected Results

Analysis Pipeline

After simulation completion, the analysis pipeline performs three major operations. First, result aggregation exports data to Arena format, generates across-replications summary with mean, standard deviation, and confidence intervals for each KPI by condition, and creates formatted CSV files compatible with Arena analysis tools.

\tech{Post-Simulation Analysis Pipeline}
Analysis Step Process Output Files
1. Aggregate Results Export to Arena format, generate summary statistics AcrossReplicationsSummary.csv with CIs
2. Main Effects Analysis ANOVA for each KPI, factor importance ranking ANOVA results JSON, factor coefficients
3. Visualization Main effects plots, interaction plots, distributions PNG plots for main effects and interactions

Second, main effects analysis performs ANOVA for each KPI to test factor significance, ranks factors by importance based on effect sizes, and analyzes interaction effects between factor pairs. Third, visualization generates main effects plots showing KPI responses to each factor level, creates interaction plots displaying factor combination effects, and produces distribution comparisons across conditions. Simulation status indicates the full DOE is currently running with an estimated completion time of approximately to hours based on observed replication rates.

Implementation Details

Technical Stack

The implementation employs Python 3 for data processing and simulation execution, R and Quarto for analysis and reporting, Pandas and NumPy for data manipulation, and Scikit-learn for model fitting. The data engineering script engineer_dataset_from_h5.py extracts features from H5 files and generates the engineered dataset. The model fitting script fit_hazard_models.py fits hazard models with cross-validation and regularization.

\tech{Implementation Scripts and Functions}
Component Script Primary Function Output Files
Data Engineering engineer_dataset_from_h5.py Feature extraction, event detection, Klein table generation Engineered CSV dataset
Model Fitting fit_hazard_models.py Hazard model fitting with CV, coefficient estimation Fitted model JSON files
Simulation simulate_trajectories.py DOE simulation with checkpointing, KPI computation Simulation results CSV, checkpoint files
Analysis analyze_doe_results.py Arena export, ANOVA, visualization generation Arena CSVs, ANOVA JSON, plot PNGs

The simulation script simulate_trajectories.py executes DOE simulation with checkpointing and error recovery, computes KPIs for each replication, and generates results files. The analysis script analyze_doe_results.py exports results to Arena format, performs ANOVA analysis, and generates visualization plots. Output format includes Arena-style CSV files with across-replications summaries, summary statistics with confidence intervals for each KPI, formatted tables for reports, and visualization plots in PNG and PDF formats.

Next Steps

After Simulation Completes

Immediate steps following simulation completion include verifying the all_results.csv structure contains all expected columns, confirming all seven KPIs are present without approach_rate (removed as inapplicable to omnidirectional stimuli), validating biologically plausible value ranges, and checking the expected number of rows ( total: conditions × replications).

\tech{Post-Simulation Workflow}
Phase Actions Deliverables
Verification Verify results CSV structure and KPI columns Validated results file
Analysis Run analysis pipeline, generate plots and ANOVA Analysis outputs (CSVs, plots, ANOVA JSON)
Presentation Update slides with actual results and visualizations Updated presentation with results
Report Generate final report with statistical interpretation TermProject_Report.qmd with complete analysis

Analysis activities involve running the complete analysis pipeline using analyze_doe_results.py, generating main effects plots showing factor importance, creating interaction plots revealing factor combinations, and reviewing KPI distributions for biological interpretation. Presentation updates will incorporate actual results into slides, add main effects visualizations with confidence intervals, include interaction plots demonstrating factor relationships, and discuss biological implications of the findings. Report generation will update TermProject_Report.qmd with analysis results, ensure all KPI references match the seven valid metrics, and re-render the report with complete statistical interpretation.

Questions?

Contact & Resources

The project repository contains all code and data in the ecs630/labs/termprojectproposal/ directory, with documentation in the docs/ subdirectory and simulation results in the output/ directory. Key files include TermProject_Proposal.qmd providing the full project proposal, POST_DOE_CHECKLIST.md containing the analysis workflow, and Simulation_Presentation.qmd representing this presentation.

\tech{Project Resources and Locations}
Resource Type Location Contents
Project Repository ecs630/labs/termprojectproposal/ All code, data, and configuration files
Documentation docs/ directory Methodology documentation and checklists
Results output/ directory Simulation results, analysis outputs, plots
Monitoring scripts/monitor_doe.py Real-time progress monitoring script

Simulation status can be checked by examining output/simulation_results/doe_run.log for execution progress, running python3 scripts/monitor_doe.py for real-time monitoring display, and reviewing checkpoint files to verify completed replications. The monitoring script provides total replications completed, conditions completed out of , progress percentage, remaining conditions, estimated time remaining, and log file size, updating automatically every seconds.